Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Location ids table and locatable events QA check #6164

Merged
merged 8 commits into from
Jun 14, 2023
Merged

Conversation

jc-harrison
Copy link
Member

@jc-harrison jc-harrison commented Jun 13, 2023

Closes #5376
Closes #5289

I have:

  • Formatted any Python files with black
  • Brought the branch up to date with master
  • Added any relevant Github labels
  • Added tests for any new additions
  • Added or updated any relevant documentation
  • Added an Architectural Decision Record (ADR), if appropriate
  • Added an MPLv2 License Header if appropriate
  • Updated the Changelog

Description

Should be two PRs really, but the changes are both small, and vaguely related...

  • Adds a task in the FlowETL ingestion DAG to update a new table events.location_ids, which records the first and last date each location ID appeared in the CDR data (per event type).
  • Adds a new QA check "count_locatable_events", which counts the number of added rows with location ID corresponding to a cell with known location (similar to the existing "count_locatable_location_ids", but counts the number of events at locatable cells rather than the number of distinct active locatable cells).

The events.location_ids table is created (if not exists) by the FlowETL task, rather than being defined in the FlowDB provisioning scripts. This is advantageous if we're updating FlowETL in an existing FlowKit deployment, since we don't have a formal migration process so would otherwise need to remember to manually create the table, but does mean the full specification of the FlowDB schema begins to spill out beyond the flowdb sub-repo, which is perhaps not the cleanest approach. I'm not sure whether this is the right decision.

@jc-harrison jc-harrison added enhancement New feature or request FlowDB Issues related to FlowDB FlowETL labels Jun 13, 2023
@cypress
Copy link

cypress bot commented Jun 13, 2023

Passing run #20323 ↗︎

0 4 0 0 Flakiness 0

Details:

Move events.location_ids DDL into flowdb scripts
Project: FlowAuth Commit: cbec09818d
Status: Passed Duration: 00:44 💡
Started: Jun 14, 2023 11:55 AM Ended: Jun 14, 2023 11:55 AM

This comment has been generated by cypress-bot as a result of this project's GitHub integration settings.

@greenape
Copy link
Member

Maybe better to add it under the etl schema rather than events through this route? I'm also thinking if anywhere it belongs in the infrastructure schema. No objection in principle to the create occurring here, but shall we add the ddl for it to the flowdb schema as well?

@jc-harrison
Copy link
Member Author

Maybe better to add it under the etl schema rather than events through this route? I'm also thinking if anywhere it belongs in the infrastructure schema. No objection in principle to the create occurring here, but shall we add the ddl for it to the flowdb schema as well?

My reasoning for putting the table in events rather than infrastructure is that it contains information specifically about the location IDs appearing in the CDR records, irrespective of whether they correspond to actual infrastructure (i.e. if we're debugging a discrepancy between CDR and cell info, this table will be providing the information from the events side).

Could put it in etl, but it doesn't really relate to the ETL, it's just populated through the ETL process. Although if it's not actually defined in the FlowDB schema and only created on-the-fly by FlowETL then arguably it should live in the etl schema.

I think you're right, we should add the DDL to the FlowDB schema. I'm not keen on having it in the FlowDB schema and in the ETL task, because then there's a risk of the two getting out-of-sync. Perhaps it's best to have it in the FlowDB scripts only, and then if the task tries to run on a schema that's not updated it will fail. In lieu of a proper migration process, it's probably not actually very helpful to start scattering partial migration steps throughout the ETL.

@greenape
Copy link
Member

A follow on question would then be if we actually want a separate events_metadata schema maybe, but that doesn't feel required at this stage.

@codecov
Copy link

codecov bot commented Jun 14, 2023

Codecov Report

Merging #6164 (cbec098) into master (e2cf418) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #6164   +/-   ##
=======================================
  Coverage   93.24%   93.25%           
=======================================
  Files         277      278    +1     
  Lines       10828    10832    +4     
  Branches      895      895           
=======================================
+ Hits        10097    10101    +4     
  Misses        602      602           
  Partials      129      129           
Impacted Files Coverage Δ
...tl/operators/update_location_ids_table_operator.py 100.00% <100.00%> (ø)
flowetl/flowetl/flowetl/util.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@greenape greenape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jc-harrison jc-harrison added the ready-to-merge Label indicating a PR is OK to automerge label Jun 14, 2023
@mergify mergify bot merged commit 52a392f into master Jun 14, 2023
9 checks passed
@mergify mergify bot deleted the location-ids-table branch June 14, 2023 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request FlowDB Issues related to FlowDB FlowETL ready-to-merge Label indicating a PR is OK to automerge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Record first/last active date for cell IDs Locatable event count QA check
2 participants